library(tidyverse)
replacing previous import ‘vctrs::data_frame’ by ‘tibble::data_frame’ when loading ‘dplyr’Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
[37m── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──[39m
[37m[32m✓[37m [34mggplot2[37m 3.3.2 [32m✓[37m [34mpurrr [37m 0.3.3
[32m✓[37m [34mtibble [37m 3.0.3 [32m✓[37m [34mdplyr [37m 1.0.1
[32m✓[37m [34mtidyr [37m 1.1.1 [32m✓[37m [34mstringr[37m 1.4.0
[32m✓[37m [34mreadr [37m 1.3.1 [32m✓[37m [34mforcats[37m 0.5.0[39m
package ‘ggplot2’ was built under R version 3.6.2package ‘tibble’ was built under R version 3.6.2package ‘tidyr’ was built under R version 3.6.2package ‘dplyr’ was built under R version 3.6.2[37m── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31mx[37m [34mdplyr[37m::[32mfilter()[37m masks [34mstats[37m::filter()
[31mx[37m [34mdplyr[37m::[32mlag()[37m masks [34mstats[37m::lag()[39m
library(cluster)
library(factoextra)
package ‘factoextra’ was built under R version 3.6.2Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(dendextend)
package ‘dendextend’ was built under R version 3.6.2
---------------------
Welcome to dendextend version 1.14.0
Type citation('dendextend') for how to cite the package.
Type browseVignettes(package = 'dendextend') for the package vignette.
The github page is: https://github.com/talgalili/dendextend/
Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
Or contact: <tal.galili@gmail.com>
To suppress this message use: suppressPackageStartupMessages(library(dendextend))
---------------------
Attaching package: ‘dendextend’
The following object is masked from ‘package:stats’:
cutree
Use k-means clustering to investigate potential relationships in the dataset computers.csv that has information of computer sales. We are interested in relationships between hd (hard drive size) and ram (type of computer memory):
computers <- read_csv("computers.csv")
Missing column names filled in: 'X1' [1]Parsed with column specification:
cols(
X1 = [32mcol_double()[39m,
price = [32mcol_double()[39m,
speed = [32mcol_double()[39m,
hd = [32mcol_double()[39m,
ram = [32mcol_double()[39m,
screen = [32mcol_double()[39m,
cd = [31mcol_character()[39m,
multi = [31mcol_character()[39m,
premium = [31mcol_character()[39m,
ads = [32mcol_double()[39m,
trend = [32mcol_double()[39m
)
glimpse(computers)
Rows: 6,259
Columns: 11
$ X1 [3m[38;5;246m<dbl>[39m[23m 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ price [3m[38;5;246m<dbl>[39m[23m 1499, 1795, 1595, 1849, 3295, 3695, 1720, 1995, 2225, 2575, 2…
$ speed [3m[38;5;246m<dbl>[39m[23m 25, 33, 25, 25, 33, 66, 25, 50, 50, 50, 33, 66, 50, 25, 50, 5…
$ hd [3m[38;5;246m<dbl>[39m[23m 80, 85, 170, 170, 340, 340, 170, 85, 210, 210, 170, 210, 130,…
$ ram [3m[38;5;246m<dbl>[39m[23m 4, 2, 4, 8, 16, 16, 4, 2, 8, 4, 8, 8, 4, 8, 8, 4, 2, 4, 4, 8,…
$ screen [3m[38;5;246m<dbl>[39m[23m 14, 14, 15, 14, 14, 14, 14, 14, 14, 15, 15, 14, 14, 14, 14, 1…
$ cd [3m[38;5;246m<chr>[39m[23m "no", "no", "no", "no", "no", "no", "yes", "no", "no", "no", …
$ multi [3m[38;5;246m<chr>[39m[23m "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "…
$ premium [3m[38;5;246m<chr>[39m[23m "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes", "yes",…
$ ads [3m[38;5;246m<dbl>[39m[23m 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 9…
$ trend [3m[38;5;246m<dbl>[39m[23m 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
summary(computers)
X1 price speed hd
Min. : 1 Min. : 949 Min. : 25.00 Min. : 80.0
1st Qu.:1566 1st Qu.:1794 1st Qu.: 33.00 1st Qu.: 214.0
Median :3130 Median :2144 Median : 50.00 Median : 340.0
Mean :3130 Mean :2220 Mean : 52.01 Mean : 416.6
3rd Qu.:4694 3rd Qu.:2595 3rd Qu.: 66.00 3rd Qu.: 528.0
Max. :6259 Max. :5399 Max. :100.00 Max. :2100.0
ram screen cd multi
Min. : 2.000 Min. :14.00 Length:6259 Length:6259
1st Qu.: 4.000 1st Qu.:14.00 Class :character Class :character
Median : 8.000 Median :14.00 Mode :character Mode :character
Mean : 8.287 Mean :14.61
3rd Qu.: 8.000 3rd Qu.:15.00
Max. :32.000 Max. :17.00
premium ads trend
Length:6259 Min. : 39.0 Min. : 1.00
Class :character 1st Qu.:162.5 1st Qu.:10.00
Mode :character Median :246.0 Median :16.00
Mean :221.3 Mean :15.93
3rd Qu.:275.0 3rd Qu.:21.50
Max. :339.0 Max. :35.00
computer_ram <- computers %>%
select(c(hd, ram))
computer_ram
Lots of data here, lets use ggplot to get some visualisation
ggplot(computer_ram, aes(hd, ram)) +
geom_point()
The data appears to already be grouped quite distinctly in 5 groups, where there are 5 values of ram. So although we could say this is good for clustering, it’s not clear there is much more we could learn from this.
#We need to scale this data, as there are some big numbers here!
normalized_df=(df-df.mean())/df.std() to use min-max normalization: normalized_df=(df-df.min())/(df.max()-df.min())
computer_ram_scale <- computer_ram %>%
mutate_all(scale)
computer_ram_scale
From the flat gradient relationships showing, k = 6 seems like the most logical choice, as there are 6 lines showing in the plot.
clustered_comp_ram <- kmeans(computer_ram_scale, centers = 6, nstart = 25)
clustered_comp_ram
K-means clustering with 6 clusters of sizes 316, 315, 1610, 2364, 821, 833
Cluster means:
hd ram
1 2.5345722 2.8488519
2 2.0993943 0.6999743
3 -0.2000123 -0.1202268
4 -0.8248092 -0.8186906
5 0.4603185 1.3697243
6 0.5182632 -0.1396435
Clustering vector:
[1] 4 4 4 3 5 5 4 4 3 4 3 3 4 3 3 4 4 4 4 3 4 4 5 4 3 4 4 3 5 3 4 4 3 4 4 4 3
[38] 6 3 3 3 6 3 3 5 3 4 4 4 4 4 4 3 4 3 4 4 4 3 4 3 4 3 4 4 4 4 4 4 3 6 4 4 3
[75] 4 3 3 3 4 4 4 3 3 3 3 4 3 4 4 3 4 4 3 3 5 4 4 4 5 4 4 3 4 4 4 3 3 4 4 4 3
[112] 4 3 5 4 4 4 3 4 4 4 4 3 3 3 3 4 3 4 4 4 4 4 4 4 3 3 4 3 4 3 6 3 4 3 3 4 4
[149] 3 6 4 3 3 3 3 3 4 3 4 4 4 4 4 3 5 4 3 4 4 4 3 3 4 3 3 4 4 3 4 4 4 4 4 3 6
[186] 3 4 3 5 5 4 4 3 4 4 3 4 4 4 3 3 3 3 4 4 4 3 3 4 6 3 3 3 4 3 4 3 4 5 4 3 3
[223] 3 3 3 3 4 5 4 3 4 3 4 3 3 4 4 4 4 6 4 4 3 3 3 3 5 3 4 4 3 4 3 4 4 4 3 4 4
[260] 5 3 3 3 3 4 4 3 3 5 4 4 6 3 5 3 5 4 3 3 4 3 4 4 4 3 3 4 3 5 4 3 4 3 3 4 5
[297] 3 5 4 4 4 4 4 5 3 4 3 3 3 3 3 3 5 3 3 4 4 3 3 4 3 5 5 3 4 4 3 4 5 4 4 4 4
[334] 4 4 5 4 5 3 3 4 3 3 4 3 3 4 3 3 4 3 3 3 4 4 4 5 3 3 3 4 4 3 3 3 4 4 3 4 3
[371] 3 4 3 3 4 4 5 5 4 4 3 3 5 3 3 3 4 4 4 4 4 3 4 4 4 5 4 4 4 3 3 4 3 5 5 4 3
[408] 3 3 5 4 4 4 3 5 4 3 4 3 4 5 5 4 3 3 4 5 3 4 3 5 5 5 4 3 4 3 4 3 5 3 3 4 4
[445] 3 5 3 4 4 3 4 3 3 3 3 3 3 4 3 5 3 4 6 5 4 4 3 4 3 6 3 4 3 3 4 4 3 4 3 4 3
[482] 4 4 4 3 3 4 3 3 3 2 3 4 3 3 3 5 4 4 3 4 4 4 3 3 4 3 5 3 6 3 3 5 4 4 3 4 4
[519] 4 4 4 3 4 3 4 5 3 4 4 3 3 4 5 3 4 3 4 4 3 4 5 3 4 3 3 3 4 3 4 4 3 4 5 4 5
[556] 4 3 5 5 4 5 4 4 3 3 4 4 4 4 3 4 3 4 4 4 3 4 5 3 4 4 3 6 3 6 4 5 4 4 4 3 4
[593] 5 4 4 4 6 3 4 4 3 5 4 5 4 4 3 4 3 4 5 6 3 3 4 2 4 3 4 4 3 4 3 3 4 6 3 4 4
[630] 3 4 4 4 4 4 4 3 4 4 3 4 3 4 5 4 5 5 3 3 4 3 3 3 3 4 3 3 4 3 3 5 4 3 4 4 6
[667] 5 4 4 3 3 5 6 5 3 3 3 4 3 4 3 4 3 6 4 3 6 6 3 6 4 3 4 5 4 4 3 4 5 4 4 4 3
[704] 5 4 4 5 4 4 4 5 3 4 3 4 4 5 5 6 3 2 4 3 4 4 4 3 5 6 4 4 4 3 4 4 4 4 3 5 5
[741] 5 4 4 3 4 5 3 4 4 3 4 3 4 5 3 3 4 3 4 4 4 4 4 5 4 4 5 4 4 3 3 4 5 5 5 3 4
[778] 4 5 3 4 4 5 3 3 3 4 3 4 5 4 5 4 3 4 3 5 4 5 3 4 5 4 3 4 3 4 3 3 4 3 4 4 5
[815] 4 4 3 4 4 4 3 3 3 3 5 4 4 4 4 5 6 4 4 3 4 4 5 3 4 3 6 3 6 5 3 4 4 4 4 4 4
[852] 3 3 4 5 3 4 4 4 3 3 4 4 4 6 4 4 3 3 3 4 4 3 4 4 3 3 4 4 3 3 5 4 3 5 4 4 5
[889] 4 4 3 5 4 4 3 4 5 5 4 3 2 3 5 4 3 4 4 4 5 4 5 4 4 3 4 4 4 4 4 4 3 4 3 3 4
[926] 4 3 3 3 4 4 5 4 4 4 3 3 3 4 4 3 5 4 4 6 3 4 3 4 4 5 6 4 4 5 3 4 3 3 5 3 4
[963] 3 4 3 5 4 4 4 3 3 3 3 4 4 3 4 4 4 4 4 2 5 3 3 3 4 4 4 4 4 4 4 2 5 3 3 4 3
[1000] 4
[ reached getOption("max.print") -- omitted 5259 entries ]
Within cluster sum of squares by cluster:
[1] 189.37178 246.58534 211.79695 242.69113 168.14783 85.39715
(between_SS / total_SS = 90.9 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
Lets use broom to get information on each cluster first
library(broom)
package ‘broom’ was built under R version 3.6.2
tidy(clustered_comp_ram) #gives info on a whole
glance(clustered_comp_ram) #info on how well its clustered
The above took 2 iterations before centroids stopped moving, although this could change.
augment(clustered_comp_ram, computer_ram) #says what's in each cluster
#We can use animation, don’t use all for presentation for clients, they tend to just want to see the result not how we got there, last one would be fine
library(animation)
computer_ram_scale %>%
kmeans.ani(centers = 6)
Error in if (all(centers == ocenters)) break :
missing value where TRUE/FALSE needed
Here the clustering appears to work well, despite initial reservations. The centroids are distinct although 2 of the clusters above look very close at the bottom left of the graphic as things stand.
glance(clustered_comp_ram)
How do we pick k? Method 1 Elbow method Method 2 Silhouette coefficient Method 3 Gap statistic
Method 1, try for k = 20
max_k <- 6
k_clusters <- tibble(k = 1:max_k) %>%
mutate(
kclust = map(k, ~kmeans(computer_ram_scale, .x, nstart = 25)),
tidied = map(kclust, tidy),
glanced = map(kclust, glance),
augmented = map(kclust, augment, computer_ram)
)
k_clusters
clusterings <- k_clusters %>%
unnest(glanced)
ggplot(clusterings, aes(x = k, y = tot.withinss)) +
geom_point() +
geom_line() +
scale_x_continuous(breaks = seq(1, 6, by = 1))
Above k =2 seems a better fit using the elbow method.
library(factoextra)
fviz_nbclust(computer_ram_scale, kmeans, method = "wss", nstart = 25)
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Again, k = 2 is showing as the best fit. Now try the silhouette coefficient
fviz_nbclust(computer_ram_scale, kmeans, method = "silhouette", nstart = 25)
Here, k = 10 is looking like the best fit.
Now lets try the gap statistic
fviz_nbclust(computer_ram_scale, kmeans, method = "gap_stat", nstart = 25)
Clustering k = 1,2,..., K.max (= 10): .. done
Bootstrapping, b = 1,2,..., B (= 100) [one "." per sample]:
.
did not converge in 10 iterations
..........
Quick-TRANSfer stage steps exceeded maximum (= 312950)
...
did not converge in 10 iterations
.
Can also visulise via ggplot, lets try this
clusterings %>%
unnest(augmented) %>%
filter(k <= 6) %>%
ggplot(aes(x = hd, y = ram)) +
geom_point(aes(colour = .cluster))
Lets try our new chosen value of k = 2 now
clusterings %>%
unnest(cols = c(augmented)) %>%
filter(k==2) %>%
ggplot(aes(x = hd, y = ram, colour = .cluster, label = sizeDiss(n))) + #num of obs.
geom_point(aes(color = .cluster)) +
geom_text(hjust = 0, vjust = -0.5, size = 3)
We can then extract the clusters to do descriptive statistics at the level of the cluster
clusterings %>%
unnest(augmented) %>%
filter(k == 2) %>%
group_by(.cluster) %>%
summarise(mean(hd), mean(ram))
`summarise()` ungrouping output (override with `.groups` argument)
There is a large variation so this is a good sign of the clusters being separated.